Chromosome-level genome assembly of the northern Pacific seastar Asterias amurensis

Asterias amurensis has attracted widespread concern because of its population outbreaks, which has impacted fisheries and aquaculture, as well as disrupting local ecosystems. A high-quality reference genome is necessary to better investigate mechanisms of outbreak and adaptive changes. Combining PacBio HiFi and Hi-C sequencing data, we generated a chromosome-level A. amurensis genome with a size of 491.53 Mb. The contig N50 and scaffold N50 were 8.05 and 23.75 Mb, respectively. The result of BUSCO analysis revealed a completeness score of 98.85%. A total of 16,531 protein-coding genes were predicted in the genome, of which 94.63% were functionally annotated. The high-quality genome assembly resulting from this study will provide a valuable genetic resource for future research on the mechanism of population outbreaks and invasion ecology.


Background & Summary
Asterias amurensis (class: Asteroidea), also known as the northern Pacific seastar, is widely distributed in the northwest and northeast Pacific, native to the coast of Alaska 1 , China 2 , Japan 3 , Korea 4 , and Russia 5 .As a benthic echinoderm with distinct evolutionary classification 6 , its reproduction mode includes not only dioecious but also asexual reproduction by arm regeneration 7,8 .Females have high fecundity and can annually spawn ~20 million eggs 3 .The planktonic stage of larva can last for seven weeks or several months, which enables them to rapidly spread in a suitable environment 9,10 .A. amurensis is located at the highest trophic level among the benthic invertebrates as a voracious and efficient generalist predator 11 , which has been reported to impact a variety of infaunal taxa, especially commercial bivalves [12][13][14] .And it has even been associated with the decline of some fish species 15 .
In the early 1980s, free-spawning starfish A. amurensis were first spotted in southeast Tasmania of Australia, possibly introduced from central Japan through ship ballast water 3 .Since their first detection, this starfish has successfully established populations in a short period and gradually expanded to Victoria [16][17][18] .As one of the most successful invasive species, A. amurensis became a significant threat to native assemblages, marine commercial species, and has damaged native ecosystems in Australia 13,19 .Thus, this starfish was listed as one of the high-priority marine pests in Australia 20 .Although its invasive range is limited in Australia 21 so far, A. amurensis will likely continue to expand due to its high fecundity, wide environmental tolerance, and long larval duration 22 , even invading the Southern Ocean 23 .However, due to the lack of genomic information in A. amurensis, genetic changes associated with invasive lineages remain unknown 16,24 .
Periodic and massive outbreaks of A. amurensis populations have been reported in several countries, including Australia, China, and Japan, which have significantly impacted fishery and mariculture grounds, as well as destroyed the original ecological balance, leading to serious economic losses [25][26][27] .Unfortunately, no effective bio-control method has been reported for this pest up to now.To provide warning information for possible outbreaks of A. amurensis, early detection technologies have been developed based on targeting rRNA 28 and the mitochondrial cytochrome c oxidase subunit I (COI) gene 21,29,30 .However, the mechanism of aggregation and outbreak is complex and unclear.Relevant studies require the support of a high-quality genome assembly, which may help to identify species-specific factors associated with aggregating starfish 31 .
In the present study, a de novo assembled chromosome-level A. amurensis genome was prepared using PacBio HiFi and Hi-C sequencing data.The final genome size was 491.53 Mb with scaffold N50 of 23.75 Mb.Using three approaches for gene structure annotation, we identified a total of 16,531 protein-coding genes, of which 15,643 genes were functionally annotated with at least one public database.A high-quality reference genome for A. amurensis will be a useful genomic resource to explore both the mechanism of population outbreak and the genetic basis underlying adaptive change during the invasion process.Meanwhile, the A. amurensis genome will be a noteworthy addition to the existing suite of Asteroidea genomes for future cell, developmental and evolutionary biology research.

Sample collection.
All samples used in this study were from a male adult A. amurensis collected by diving in Qingdao, Shandong Province, China (36°03′04″N, 120°21′26″E) in November 2022.Fresh gonad tissue from the base of the arm was excised and washed with phosphate buffered saline (PBS, 1X).It was then immediately frozen in liquid nitrogen and transferred to −80 °C for storage.High quality DNA was extracted from gonad using DNeasy Blood & Tissue Kit (Qiagen, Germany) for long-read and short-read whole genome sequencing.To aid in structural annotation, nine tissues including gonad, body wall, madreporite, spine, mouth, stomach, muscle, podia, and eye spot were used for transcriptome sequencing.All tissues were isolated separately with scissors and forceps, and then treated in the same way as the gonad collection.Total RNA was extracted using the TRIzol reagent (Vazyme, China).
Sequencing.For long-read sequencing, high molecular weight genomic DNA (gDNA) was fragmented to approximately 15 kb to construct a PacBio HiFi library.The sequencing library was generated using the SMRTbell Express Template Prep kit 2.0 (Pacific Biosciences, USA), following the manufacturer's recommendations, as described in the previous study 32 .The library was finally sequenced with circular consensus sequencing (CCS) mode on the PacBio Sequel II system using a single 8 M cell.After filtering out the low-quality reads and sequence adapters, a total of 11.15 Gb CCS data were obtained with a mean length of 12.51 kb (Table 1).
For short-read whole genome sequencing, gDNA was fragmented into approximately 350 bp for library construction.The library was sequenced on DNBSEQ-T7 platform to generate 150 bp paired-end (PE150) reads.After filtering out low-quality reads including reads shorter than 100 bp, reads that contained >10% "N", and   reads that contained >50% low-quality bases (Phred score ≤10), the clean data generated was 112.58 Gb, which covered ~229X of the genome (Table 1).
The chromosome conformation capture (Hi-C) technique was employed to assemble a chromosome-level genome.The fresh gonad was crosslinked using formaldehyde solution and digested with four-cutter restriction enzyme (DpnII).The ends of the restriction fragments were labeled with biotinylated nucleotides, and then the ligated DNA was sheared into fragments from 300 bp to 700 bp in length for Hi-C library construction.The resulting library was quantified with the Q-PCR method and sequenced with the DNBSEQ-T7 platform.After removing adapters and low-quality short reads, a total of 102.75 Gb (209.04 × coverage) of clean data was generated, with Q20 = 97.32% and Q30 = 92.33%(Table 2).
For transcriptome sequencing, total RNA of nine tissues from the same starfish was extracted and equally pooled for cDNA library construction.The resulting library was constructed by NEBNext ® Ultra ™ RNA Library Prep Kit (NEB, USA) according to the manufacturer's instructions and sequenced on Illumina NovaSeq6000 system, finally generating 13.47 Gb clean data to help genome structure annotation.

Genome assembly.
Based on PaciBio HiFi reads, Hifiasm (v0.18.4) 33 was applied for de novo assembly of primary contigs with default parameters.Haplotypic and heterozygous duplication was removed using purge_ dups (v1.2.6) 34 with the parameter of cutoffs '-l 5 -m 18 -u 54' .A primary assembly was generated, consisting of 90 contigs spanning 491.50 Mb.N50 and the maximum contig length were 8.05 and 28.59 Mb, respectively (Table 3).We further scaffolded the contigs using Hi-C sequencing data to obtain a high-quality chromosome-scale genome.Juicer (v1.6) 35 was applied for raw sequence data analysis and then 3D-DNA (v190716) 36 was used to anchor contigs into chromosomes.The assembly was further corrected manually according to the Hi-C heatmap using JuiceboxGUI (v1.11.08) 37 , a visualization system for Hi-C contact maps.The final genome consisted of 22 chromosomes with lengths ranging from 13.43 to 38.00 Mb, and the N50 was 23.75 Mb (Table 3, Fig. 1, Fig. 2).Previous karyotype analysis 38 of A. amurensis indicated that it had a diploid chromosome number of 44, which was consistent with our results.

Gene prediction and functional annotation.
We used three approaches for predictions of gene structures, including de novo, homology-based, and RNA-seq-based prediction.Augustus (v3.4.0) 45 , GlimmerHMM   50 with default parameters to predict gene structures.For RNA-seq-based prediction, we firstly mapped short RNA reads to reference genome using HISAT2 (v2.2.1) 51 with the parameter '-dta' and then assembled transcripts using StringTie (v2.2.1) 52 .Meanwhile, the Program to Assemble Spliced Alignments (PASA, v2.4.1) pipeline (https:// github.com/PASApipeline/PASApipeline)was used to identify possible coding regions based on de novo transcriptome assembled by Trinity (v2.14.0) 53 with default parameters.Then, EvidenceModeler (EVM, v1.1.1) 54and Funannotate (v1.8.14) pipeline (https://github.com/nextgenusfs/funannotate)were applied for combining predicted results from three strategies and removal of low-quality gene annotations.Based on the RNA-seq data of A. amurensis from this study, adult stomach tissue 55 , and bipinnaria larval 16 from other studies, PASA (v2.4.1) was applied for the update of untranslated regions (UTRs).The general annotation pipeline applied in the present study was shown in Fig. 3.As a result, a total of 16,531 protein-coding genes were predicted and the average gene length was 17,803.19bp, with an average coding sequence (CDS) length of 1,885.87bp and average exon number of 10.07 (Table 6).Among them, 12,736 (77.04%) genes were supported by evidence from all three strategies Fig. 3 The general annotation pipeline of repetitive elements, ncRNAs, and protein-coding genes.
(Fig. 4).We also counted the density of genes on different chromosomes with a window of 1 Mb in length (Fig. 5) and simply compared gene length, CDS length, exon length, intron length and exon number per gene of A. amurensis and other species used in homology-based predictions (Fig. 6).The 1 Mb region with the largest number of annotated genes were from the end of chromosome 18 (Fig. 5).

Comparative genomic analysis.
The longest protein sequences of A. amurensis and other five asteroid species including Acanthaster sp. 63, Asterias rubens 64 , Patiria miniata 65 , Plazaster borealis 66 , and Zoroaster cf.ophiactis 67 were utilized to identify orthologous groups using OrthoFinder (v2.5.5) 68 with the parameters '-S diamond' , and the sea urchin Lytechinus variegatus 69 was selected as an outgroup.A total of 5,315 single-copy orthogroups were obtained for subsequent phylogenetic analysis.Based on multiple sequence alignments of the single-copy orthogroups using MAFFT (v7.520) 70 , IQ-TREE (v2.2.3) 71 was applied for construction of the species trees with the parameters '-m MFP -bb 1000' and the best model of GTR + F + I + R4.Predictably, A. amurensis was most closely related to A. rubens and P. borealis from the family Asteriidae (Fig. 8).Then, divergence times were estimated using MCMCTREE in PAML (v4.9i) 72 based on the divergence time (A.amurensis vs L. variegatus: 461.1.5-600.0 million years ago) extracted from TIMETREE (http://www.timetree.org/).The expansion and contraction of gene families were analyzed by Computational Analysis of gene Family Evolution (CAFE, v5.0.0)   with a p-value of 0.05.The results revealed that 197 and 482 gene families were expanded and contracted in A. amurensis, respectively (Fig. 8).

technical Validation
Nucleic acid quality.The concentration and quality of DNA were evaluated using Nanodrop 2000 spectrophotometer (Thermo Fisher Scientific, USA) and agarose gel electrophoresis, respectively.RNA integrity was assessed using Agilent 2100 Bioanalyzer (Agilent Technologies, USA).

Genome assembly and annotation quality evaluation.
The quality of the final chromosome-level genome assembly was assessed using four methods as follows.Firstly, we mapped clean PE150 reads from whole genome sequencing to A. amurensis genome using BWA-MEM (v0.7.17) 80 and calculated the mapping rate using samtools (v1.9)

Fig. 1
Fig. 1 Genome-wide heatmap of Hi-C interactions among 22 chromosomes in A. amurensis.The scale bar represents the interaction frequency of Hi-C links.

Fig. 2
Fig. 2 Circos plot of genomic features in A. amurensis genome.The tracks from outside to inside indicate: (1) length of 22 chromosomes (Mb), (2) distribution of GC content with a window of 1 Mb, (3) distribution of repeat elements with a window of 1 Mb, (4) distribution of ncRNAs with a window of 1 Mb, and (5) distribution of protein-coding genes with a window of 1 Mb.

Fig. 4
Fig.4 Venn diagram of gene structure prediction from de novo, homology-based and RNA-seq-based strategies.

Fig. 5
Fig. 5 Number of genes on 22 chromosomes with a window of 1 Mb.The scale bar represents the density of genes.

Fig. 6
Fig. 6 Comparisons of CDS length, exon length, exon number per gene, gene length and intron length among A. amurensis and other relative species.

Fig. 7 Fig. 8
Fig. 7 Upset plot and Venn diagram of functional annotation for protein-coding genes based on different databases, including InterPro, KEGG, Nr, Pfam, and Swiss-Prot.

Table 1 .
Statistical analysis of sequencing reads from BGI, Illumina and PacBio.

Table 2 .
Statistical analysis of sequencing data from Hi-C.

Table 3 .
Assembly statistics of A. amurensis genome.

Table 4 .
Classification of repetitive sequences in A. amurensis genome.

Table 6 .
Statistical results of the gene structure annotation in A. amurensis genome.

Table 7 .
84, resulting in a genome coverage rate of 99.95% and a mapping rate of 99.61%.Secondly, the results of Benchmarking Universal Single-Copy Orthologs (BUSCO, v5.2.2) 82 analysis based on 954 genes of metazoa_odb10 database indicated that 951 (99.69%) core metazoan genes were detected in A. amurensis genome, consisting of 943 (98.85%) complete and 8 (0.84%) fragmented genes (Table8).Thirdly, the Core Eukaryotic Genes Mapping Approach (CEGMA, v2.5)83based on 248 core eukaryotic genes showed that 236 (95.16%) genes were identified in the final genome assembly. Fi, meryl (v1.3)84was used to generate k-mer counts based on paired-end reads generated by whole genome sequencing, and Merqury (v1.3)84was utilized to estimate the consensus quality value (QV) of A. amurensis genome, resulting in a QV of 48.51.The results from the four methods above revealed the high accuracy and completeness of the final genome assembly.Summary of the functional gene annotation in A. amurensis genome.

Table 8 .
BUSCO evaluation of gene annotation in A. amurensis genome.